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Abstract 

A common evaluation practice in the vector 
space models (VSMs) literature is to measure 
the models’ ability to predict human judg¬ 
ments about lexical semantic relations be¬ 
tween word pairs. Most existing evaluation 
sets, however, consist of scores collected for 
English word pairs only, ignoring the poten¬ 
tial impact of the judgment language in which 
word pairs are presented on the human scores. 

In this paper we translate two prominent 
evaluation sets, wordsim353 (association) and 
SimLex999 (similarity), from English to Ital¬ 
ian, German and Russian and collect scores 
for each dataset from crowdworkers fluent in 
its language. Our analysis reveals that human 
judgments are strongly impacted by the judg¬ 
ment language. Moreover, we show that the 
predictions of monolingual VSMs do not nec¬ 
essarily best correlate with human judgments 
made with the language used for model train¬ 
ing, suggesting that models and humans are 
affected differently by the language they use 
when making semantic judgments. Finally, we 
show that in a large number of setups, multi¬ 
lingual VSM combination results in improved 
correlations with human judgments, suggest¬ 
ing that multilingualism may partially com¬ 
pensate for the judgment language effect on 
human judgments^] 


1 Introduction 

In recent years, there has been an immense in¬ 
terest in the development of Vector Space Mod¬ 
els (VSMy) for word meaning representation. Most 

1 All the datasets and related documents produced in this 
work will be released upon acceptance of the paper. 


VSMs are based on the distributional hypothesis 
(Harris, 1954), stating that words that occur in simi¬ 
lar contexts tend to have similar meanings. 

VSMs produce a vector representation for each 
word in the lexicon. A common evaluation practice 
for such models is to compute a score for each mem¬ 
ber of a word pair set by applying a similarity func¬ 
tion to the vectors of the words participating in the 
pair. The resulting score should reflect the degree 
to which one or more lexical relations between the 
words in the pair hold. The correlation between the 
model’s scores and the scores generated by human 
evaluators is then computed. 

Humans as well as VSMs may consider various 
languages when making their judgments and predic¬ 
tions. Recent research on multilingual approaches 
to VSMs aims to exploit multilingual training (train¬ 
ing with corpora written in different languages) to 
improve VSM predictions. The resulting models are 
evaluated either against human scores, most often 
produced for word pairs presented to the human 
evaluators in English, or on multilingual text min¬ 
ing tasks (§ [2]). While works of the latter group do 
recognize the connection between the VSM training 
language (TLj and the task’s language, to the best of 
our knowledge no previous work systematically ex¬ 
plored the impact of th e judgment language (JL), the 
language in which word pairs are presented to hu¬ 
man evaluators, on human semantic judgments and 
on their correlation with VSM predictions. 

In this paper we therefore explore two open is¬ 
sues: (a) the effect of the JL on the human judg¬ 
ment of semantic relations between words; and (b) 
the effect of the TL(s) on the capability of VSMs 
to predict human judgments generated with differ¬ 
ent JLs. To address these issues we translate two 




prominent datasets of English word pairs scored for 


semantic relations: WordSim353 (WS353, (Finkel- 


stein et al., 2001)), consisting of 353 word pairs 


scored for association, and SimLex999 (SL999, 
(Hill et al., 2014bI), consisting of 999 word pairs 
scored for similarity. For each dataset, the word 
pairs and the annotation guidelines are translated to 
three languages from different branches of the Indo- 
European language family: German (Germanic), 
Italian (Romance) and Russian (Slavic). We then 
employ the CrowdFlower crowdsourcing service |^] 
to collect judgments for each set from human evalu¬ 
ators fluent in its JL (§[3]). 

In § [5] we explore the hypothesis that due to a va¬ 
riety of factors - linguistic, cultural and others - the 
JL should affect human generated association and 
similarity scores. Indeed, our results show that inter 
evaluator agreement is significantly higher within a 
JL than it is across JLs. This suggests that word as¬ 
sociation and similarity are JL dependent. 

We then investigate (§ [6]) the connection between 
the VSM TL and the human JL. We experiment with 
two VSMs that capture distributional co-occurrence 
statistics in different ways: a bag-of-word (BOW) 
model that is based on direct counts and the neu¬ 
ral network (NN) based word2vec (w2v, ([Mikolov 


|et al., 2013al >). We train these models on mono¬ 
lingual comparable coipora from our four JLs (§ [4} 
and compare their predicted scores with the human 
scores produced for the various JLs. Our analysis 
reveals fundamental differences between word asso¬ 
ciation and similarity. For example, while for asso¬ 
ciation the predictions of a VSM trained on a given 
language best correlate with human judgments made 
with that language, for similarity some JLs better 
correlate with all monolingual models than others. 

Finally (§[7]), we explore how multilingual model 
combination affects the ability of VSMs to predict 
human judgments for varios JLs. Our results show a 
positive effect for a large number of TL and JL com¬ 
binations, suggesting that multilingualism may par¬ 
tially compensate for the judgment language effect 
on human semantic judgments. 


2 Previous Work 

Vector Space Models and Their Evaluation. Ear- 

2 http://www.crowdflower.com/ 


her VSM work (see (Turney and Pantel, 2010 1 ) 

designed word representations based on word co- 
location counts, potentially post-processed using 
techniques such as Positive Pointwise Mutual Infor¬ 
mation (PPMI) and dimensionality reduction meth¬ 
ods. Recently, much of the focus has drifted to 
the development of NNs for representation learning 

(Bengio et al., 2003; Collobert and Weston, 2008; 

Collobert et al., 2011; Huang et al., 2012; Mikolov 

et al., 2013a; Mikolov et al., 2013c; Levy and Gold- 

berg, 2014; Pennington et al., 2014, inter aha). 

VSMs have been evaluated in two main forms: (a) 
comparing model-based word pair scores with hu¬ 
man judgments of various semantic relations. The 
model scores are generated by applying a similar¬ 
ity function, usually the cosine metric, to the vec¬ 
tors generated by the model for the words in the pah 

([Huang et al., 2012j Baroni et al., 2014; Levy and 

Goldberg, 2014; Pennington et al., 2014; Schwartz 

et al., 2015} inter aha); and (b) evaluating the contri- 
bution of the generated vectors to NLP applications 

(Collobert and Weston, 2008;[Collobert et al., 20111 

Pennington et al., 2014 inter aha). 

Several evaluation sets consisting of English 
word pairs scored by humans for semantic relations 
(mostly association and similarity) are in use for 
VSM evaluation. Among these are RG-65 (Ruben- 

stein and Goodenough, 1965), MC-30 (Miller and 

Charles, 1991), WS353 ((jFinkelstein et al., 2001)), 

yp- 130 (pang and Powers, 2006), and SL999 (Hill 

et al., 2014b). Recently a few evaluation sets con- 
sisting of scored word pairs in languages other than 
English (e.g. Arabic, French, Farsi, German, Por¬ 
tuguese, Romanian and Spanish) were presented 

(Gurevych, 2005 Zesch and Gurevych, 2006 Has- 

san and Mihalcea, 2009; Schmidt et al., 201H 

Camacho-Collados et al., 2015; Koper et al., 2015 

inter aha). Most of these datasets, however, are 
translations of the English sets, where the original 
human scores produced for the original English set 
are kept. Even for those cases where evaluation 

sets were re-scored (e.g. (Camacho-Collados et al., 


JL effect is much more thorough^] A comprehen- 


3 Although WS353 was translated to German and then scored 
with the German JL jKdper et al., 2015) , we translated and 
scored the dataset again in order to keep the same translation 
and scoring decisions across our datasets. We applied the same 








































































































sive list of these datasets, as well as of evaluation 
sets for word relations beyond word pair - similarity 
and association (Mitchell and Lapata, 2008| Bruni et 
al., 2012[Baroni et al., 2012, inter alia), is given at 
http://wordvectors.org/suite.php 

Multilingual VS Modeling. Recently, there has 
been a growing interest in multilingual vector space 
modeling (Klementiev et al., 2012; Lauly et al., 
2013[ |Khapra et al., 2013 Hermann and Blunsom, 
2014b Hermann and Blunsom, 2014a; Kocisky et 
al., 2014[ |Lauly et al., 2014 Al-Rfou et al., 20f3] 


Faruqui and Dyer, 2014; Coulmance et al., 201 5| in¬ 
ter alia). These works train VSMs on multilingual 
data, either parallel or not, or combine VSMs trained 
on monolingual data. The resulting models are eval¬ 
uated either against human scores, most often pro¬ 
duced for word pairs presented to the human evalua¬ 
tors in English, or on multilingual text mining tasks. 


3 Multilingual Human Judgment Data 


Here we describe the data collection process, con¬ 
sisting of dataset translation ( |3.1| ) and scoring (3.2 1 . 
Our w orking datasets are WS353 (fFTnkelstein et al., 


2001) and SL999 (Hill et al., 2014b)Q 


3.1 Evaluation Sets Translation 

We started by translating the WS353 and SL999 
scoring guidelines to the target languages. For each 
language the translation was done by two native 
speakers, and disagreements were solved through a 
discussion mediated by an experiment manager. An 
external evaluator, fluent in both the target language 
and in English then verified the translation quality. 

The word pair translation process was more com¬ 
plicated. We followed the same protocol outlined 
above and further set a number of rules that guided 
our translators in challenging cases. Below we dis¬ 
cuss the different types of translation ambiguities 
addressed in our guidelines. 

Gender. In some cases English does not make 
gender distinctions that some of the other languages 
do. For example, the English word cat refers to both 
the female and the male cat while in Russian and 

considerations when re-scoring the original English versions of 
WS353 and SL999. 

4 The original datasets and annotation guidelines are avail- 

able at http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html 
and http://www.cl.cam.ac.uk/~fh295/simlex.html respectively. 


Italian each gender has its own word (e.g. gatto and 
gatta in Italian). In such cases, if the other word in 
the English pair has a clear gender interpretation we 
followed this gender in the translation of both words, 
otherwise we chose one of the genders randomly and 
kept it fixed across the target languages]^] 

Word Senses. It is common that some words in a 
given language have a sense set that is not conveyed 
by any of the words of another given language. For 
example, the English word plane, from the WS353 
pair (car,plane), has both the airplane and the ge¬ 
ometric plane senses. However, to the best of our 
translators’ knowledge, no German, Italian or Rus¬ 
sian word has these two senses. 

We assume that when the authors of the evalu¬ 
ation sets paired two words, they referred to their 
closest senses. Therefore, like for gender, we used 
the other word in the pair for sense disambiguation. 
In our example, plane is translated to the target lan¬ 
guage word which has the airplane meaning (e.g. 
Flugzeug in German, aeroplano in Italian), since 
this sense is closer to the meaning of car. 

In cases where the other word in the pair does 
not clearly disambiguate the sense of its polysemous 
counterpart, we randomly chose one of the latter 
word’s senses, and kept it fixed across the target lan¬ 
guages. Consider, for example, the SL999 pair (por¬ 
tray,decide). Portray has three senses]^]- one related 
to describing someone or something, one related to 
showing in painting and one to playing a character 
in a tv show, play or a movie. Since it was not clear 
to our translators how the word decide can facilitate 
sense selection, we randomly chose the first sense 
and used it across target languages. 

Sense disambiguation is done on a POS basis as 
well. For example, in the pair (attempt,peace) at¬ 
tempt can be a verb or a noun, but none of these 
senses is necessarily closer to the meaning of peace. 
In such cases, reasoning that words with the same 
POS tend to have a closer meaning, we used the in¬ 
terpretation of the polysemous word which has the 
same POS as the other word in the pan - . That is, in 
the current example attempt was assigned its noun 
sense, as peace is a noun. Naturally, the target lan¬ 
guage translation of a given English word may also 

5 We did not observe any case of gender disagreement be¬ 
tween languages. 

6 http://www.merriam-webster.com/dictionary/portray 






























































have multiple senses, some of which arc not ex¬ 
pressed by the English word. We guided our trans¬ 
lators to avoid such translations whenever possible, 
although in a few cases that was impossible. 

Pair Exclusion. We excluded some of the pairs 
from the evaluation sets in our experiments. Three 
pairs were excluded from WS353 due to translation 
difficulties. The pairs (noon,midday) and (coast, 
shore) were excluded because none of the target lan¬ 
guages includes two different words that convey the 
meaning of either set. The pair (football,soccer) was 
also excluded since it reflects a cultural distinction 
that is not made in the target languages. The result¬ 
ing datasets in all four languages therefore consist of 
350 word pairs. 

For SL999 all 999 pairs were translated, scored 
and employed in the JL analysis of § [5] However, 
as for 23 of the pair's at least one of the participating 
words did not appear in at least one of the VSM train¬ 
ing corpora (see § [4]), we excluded these pairs from 
the analysis of the relations between the TLs and the 
JLs (§[6] and §[7]>. 

Inter Translator Agreement. The disagreement 
rates between our two translators for WS353 (700 
words) and SL999 (1998 words) are (left paren¬ 
theses for WS353, right for SL999): Russian ((85 
words, 12.1%), (353 words, 17.7%)), Italian ((57 
words, 8.1%), (196 words, 9.8%)) and German 
((113 words, 16.1%), (396 words, 19.8%)). To re¬ 
solve disagreements, for each language we asked 
one of the translators to choose the translation which 
is more similar in meaning to the other word in the 
pair. If this is not possible, the translator was asked 
to choose the word which seems to her more com¬ 
mon in the target language. 


on a 0-10 scale. 

We verified the quality of our evaluators through 
a three step process. First, for each JL we only 
recruited evaluators who were located at a country 
where this language is the mother tongue of the ma¬ 
jority of the population (i.e. US, Germany, Italy or 
Russia). Second, in order to make sure that our eval¬ 
uators understand the task properly, we generated 7 
tests for each language, each consisting of two word 
pairs that do not appear in the evaluation set. The 
participating pairs consisted of words that were ei¬ 
ther very similar or very dissimilar. Before scoring a 
batch of word pairs, each evaluator was presented 
with a randomly sampled test in its language and 
was asked to score its word pairs. Every evaluator 
that assigned a similar pair with a score lower than 
7 or a non-similar pah' with a score higher than 3 
was excluded from the experiment. Finally, we ran 
an outlier detection procedure in order to exclude 
evaluators whose scores were substantially different 
from those of the other evaluators of their batch. [] 
For each evaluator we computed the distance of its 
average score from the average of the other evalu¬ 
ators and normalized by the standard deviation of 
the latter set. Evaluators whose statistic was above a 
predefined threshold^] were excluded from the final 
dataset. We performed this procedure periodically 
and once a batch had 13 annotators that passed the 
test we stopped collecting scores for that batch. 

4 Vector Space Models 

Here we describe the VSMs we employ, their training 
data and evaluation protocol. 

4.1 Models 


3.2 Word Pair Scoring 

We next describe the word pair scoring process. In 
order to keep our analysis unbiased across JLs, we 
scored WS353 and SL999 in all four languages, in¬ 
cluding English. We divided each dataset to non¬ 
overlapping batches of 50 word pairs each (7 for 
WS353, 20 for SL999, with one SL999 batch con¬ 
sisting of 49 pairs) and employed the crowdflower 
crowdsourcing service to recruit fluent speakers of 
each target language to score each batch. Evaluators 
were presented with the scoring guidelines trans¬ 
lated to their JL and were asked to score the pairs 


Bag of Words (BOW). We constructed a VSM fol¬ 


lowing the optimal performance guidelines of (Kiela 


and Clark, 2014). After extracting the k most fre¬ 


quent words in the training corpus, we generated a 
matrix of co-occurrence counts with a row for each 
of the words in any of the pairs in an evaluation set, 


7 Some works that employ crowdsourcing compare some of 
the collected annotation to a pre-prepared gold standard. We 
consider our outlier detection process an alternative as it keeps 
only those annotators who tend to agree with the others. 

8 The threshold was set to 1.45, reasoning that if the statistics 
were sampled from a Gaussian with the empirical mean and 
variance, then ~80% of the evaluators would be included. 







and a column for each of the k most frequent words. 
Co-occurrence was counted within a window size C, 
without crossing sentence boundaries. The entries 
of the matrix were then normalized to PPMI values. 
The resulting matrix’s rows constitute the vector rep¬ 
resentations of the words @ 

word2vec. The Mikolov et al.’s NN model 

0 

The model aims to learn word representations that 
maximize the objective: 

T 

i = E E log p(w t +j\w t ) 

t= 1 —C<j<C,j^ 0 

Where T is the number of training tokens, and c a 
window size parameter. The objective respects sen¬ 
tence boundaries, conditioning only words from the 
same sentence on each other. 0] 

We tuned three parameters D - the vector dimen¬ 
sionality, F - a frequency cutoff for words to be in¬ 
cluded in the objective, and c - the window size. We 
followed Radim Rehurek’s w2v tutorial E and set 
c = 5, D = 400 and F = 1 for all TLs. 


(Mikolov et ah, 2013a Mikolov et ah, 2013b 


4.2 Training and Word Pair Scoring 


We trained our VSMs on the Wikipedia corpora re¬ 
leased by (Al-Rfou et al., 2013) {^] This is a set of 
multilingual comparable corpora, as Wikipedia en¬ 
tries covering the same topic have similar content 
across languages. This allows us to focus on the ef¬ 
fect of the (TL, JL) combination, while keeping the 
training topics fixed across languages. 

The size of these corpora is as follows (left num¬ 
ber for the number of word types, right number 
for the number of word tokens): English (3.98 M, 
1.4 G), German (5.1 M, 484.5 M), Italian (1.65 M, 
281.6 M), Russian (2.81 M, 230 M). Before training 
the models, we cleaned the corpora, removing stop- 
words and any string that is not comprised of alpha¬ 
betic characters onlyp*]and stemming the remaining 
words using an NLTK stcmmerp'j 


9 We experimented with k 6 {1000, 2000,..., 10000} and 
Ce {2,3,. ..,8} and set k = 10000 and C = 2 for all TLs. 

10 http://word2vec.googlecode.eom/svn/trunk/word2vec.c 

11 We excluded this detail from the objective for brevity. 
12 http://radimrehurek.com/2014/02/word2vec-tutorial/ 
13 https://sites.google.com/site/rmyeid/projects/polyglot 
14 According to the NLTK list, http://www.nltk.org/ 
l5 http://www.nltk.org/howto/stem.html 


The score assigned to a word pair by a model is 
the cosine similarity between the vectors the model 
induces for the pair’s words. For each (TL, JL) pair 
we compute the Spearman correlation coefficient (p) 
between the ranking derived from a model’s scores 
and the ranking derived from the human scores P’1 
Our main experimental setup reflects a preference 
for comparable corpora. This choice has conse¬ 
quences: first, our English and German corpora are 
substantially larger than their Russian and Italian 
counterparts; and, second, our training corpora are 
smaller than some of the alternative publicly avail¬ 
able corpora that have been used for VSM training. 

To exclude the possibility that our observations 
are the mere outcome of these biases, we replicated 
the experiments of § [6] and § [7] in two additional 
setups. First, in the small training setup we re-ran 
our experiments when the English and the German 
training corpora were cut to the size of the Russian 
or the Italian corpus. The results in this setup were 
averaged over 5 random samples from each coipus. 
Second, in the large training setup we re-ran our 
experiments when the English, German and Italian 
corpora were replaced with much larger, incompa¬ 
rable corpora: English with the 8G word tokens cor¬ 
pus constructed using the w2v script, E and Italian 
and German with the WaCky corpora ((?){EJ Ital¬ 
ian: 1.585G word tokens, German: 1.278G word 
tokens)^] Since the result patterns in these setups 
are very similar to those in the major, comparable 
corpora setup, we report them briefly. 

5 The Judgment Language Effect 

Our first question is: how does the JL affect the word 
pair scores produced by the human evaluators. To 
provide a quantitative answer, we run the following 
protocol, both within and across JLs. For each of 
the 50 word pair batches, we generate all possible 
A'-size subsets of the batch evaluators, each A-size 
subset defining a unique partition of these evaluators 

l6 Result patterns are very similar when considering the Pear¬ 
son and Kandall Tao scores. We hence keep our presentation 
concise and report only the Spearman scores. 

17 code.google.com/p/word2vec/source/browse/trunk/demo- 
train-big-model-v 1 .sh 

18 http://wacky. sslmit.unibo.it/doku.php 
19 Russian is not included in this latter setup since we could 
not find a publicly available substantially larger Russian corpus. 













L 1 IL 2 

English 

German 

Italian 

Russian 


mean 

std 

mean 

std 

mean 

std 

mean 

std 

English 

0.838 | 0.896 

0.083 | 0.033 

0.752 

0.105 

0.739 

0.092 

0.739 

0.110 

German 

0.648 

0.187 

0.808 | 0.864 

0.062 | 0.055 

0.700 

0.105 

0.720 

0.076 

Italian 

0.729 

0.084 

0.633 

0.197 

0.879 | 0.871 

0.053 | 0.055 

0.720 

0.121 

Russian 

0.724 

0.097 

0.621 

0.170 

0.705 

0.073 

0.880 | 0.880 

0.045 | 0.033 


Table 1: Average Spearman p correlation coefficient between human judgments in the within and the cross language setups. 
The (Li, 1/2) table entry (which is further divided into mean and standard deviation ( std ) columns) corresponds to the comparison 
of evaluators with judgment language Li to evaluators with judgment language L 2 . For each pair of languages the entry above 
the main diagonal of the matrix is for WS353 and the entry below the main diagonal (italic font) is for SL999 (for example, the 
(German, Italian) entry is for WS353 while the (Italian, German) entry is for SL999). On the main diagonal, for both the mean and 
the std entries, the left number is for SL999 while the right number is for WS353. 


T | J 

English 

German 

Italian 

Russian 

English 

0.600 

0.523 

0.488 

0.496 

German 

0.387 

0.414 

0.360 

0.408 

Italian 

0.485 

0.410 

0.451 

0.427 

Russian 

0.403 

0.377 

0.360 

0.426 


T | J 

English 

German 

Italian 

Russian 

English 

0.652 

0.618 

0.614 

0.585 

German 

0.537 

0.595 

0.505 

0.554 

Italian 

0.564 

0.483 

0.569 

0.504 

Russian 

0.574 

0.532 

0.495 

0.606 


(a) bow - ws353 (b) w2v - ws353 


T | J 

English 

German 

Italian 

Russian 

English 

0.266 

0.354 

0.308 

0.260 

German 

0.198 

0.342 

0.249 

0.170 

Italian 

0.207 

0.299 

0.293 

0.197 

Russian 

0.160 

0.250 

0.242 

0.234 


T | J 

English 

German 

Italian 

Russian 

English 

0.214 

0.304 

0.271 

0.220 

German 

0.086 

0.268 

0.199 

0.087 

Italian 

0.140 

0.236 

0.214 

0.115 

Russian 

0.141 

0.240 

0.226 

0.157 


(c) bow - sl999 (d) w2v - sl999 

Table 2: Spearman p correlation coefficient between human scores and VSM scores. The (T, J) entry of each matrix presents the 
p value between the scores of a VSM trained on language T and the human scores produced for judgment language J. In each table, 
for each training language (row) the best judgment language is highlighted in bold. 


(we set K to 6). Then, for the within language eval¬ 
uation we calculate the correlation between the av¬ 
eraged word pair scores of the two subsets induced 
by each K-size subset selection. For the cross lan¬ 
guage evaluation, in turn, we calculate the correla¬ 
tion between the average word pair scores of each 
K-size subset of language 1 with its corresponding 
subset of language 2. The resulting p scores were 
averaged to get a final score for each language (in 
the within-language case) and language pair (in the 
cross-language cascjp*] 

Table [T] presents our results. The correlations 
within a JL are clearly higher compared to their cross 
JL counterparts, with mean values at the range of 
[0.864 - 0.896] for WS353 and [0.808 - 0.880] for 
SL999 within JLs, compared to [0.700 — 0.752] for 
WS353 and [0.621 — 0.729] for SL999 across JLs. 

For both evaluation sets, we ran the Welch’s t-test 

20 We have 1716 K-size subsets for each batch and totals of 
1716*7 and 1716*20 correlations for each WS353 and SL999 
scenarios respectively. 


for each set of correlations computed for an indi¬ 
vidual language with each set of correlations com¬ 
puted for a pair of languages. In all 24 cases p*| of 
each evaluation set the null hypothesis stating that 
the two sets have an equal mean was rejected with 
Pvalue < 0.001. 

Further, the standard deviation values are [0.033— 
0.055] for WS353 and [0.045 - 0.083] for SL999 
in the within language setup, compared to [0.076 — 
0.121] for WS353 and [0.073 - 0.197] for SL999 in 
the cross language setup. These results reflect the 
weaker dependence of the human judgment in the 
within language setup on the involved word-pairs 
and human evaluators. 

To better understand the JL effect, for each JL 
we rank the word pairs according to their average 
human score and, then, compute for each pair of 
JLs the relative F-score between corresponding quin¬ 
tiles in the rankings. The top line of Figure [I] 

21 we have four languages and hence six language pairs. 

22 For each of the 1716 K-size subset pairs (all possible divi- 






















































Human Scores - English WS353 


Human Scores - English SimLex999 


Human Scores - German WS353 





Human Scores - German SimLex999 



German BOW vs. Human - WS353 German W2V vs. Human - WS353 




German BOW vs. Human • SimLex999 


German W2V vs. Human - SimLex999 




Figure 1: Relative F-score of the word pair lists in corresponding quintiles of: (a) human rankings with different judgment 
languages (top line, two left graphs show all combinations of English with another language, two right graphs show the same for 
German); and (b) model rankings with training language U vs. human ranking with judgment language h (bottom line, graphs 
presented for all combinations of models (BOW and w2v) and evaluation sets (WS353 and SL999) for 11 = German). Languages 
are denoted with a one or a three letter abbreviation, M stands for model and H for human. 


(two left graphs for comparisons where English is 
involved, two right graphs for comparisons where 
German is involved) reveals that the overlap be¬ 
tween corresponding quintiles is substantially larger 
for the top and bottom quintiles (top and bottom 
20% of the word pairs according to each of the rank¬ 
ings) compared to quintiles 2-4. The graphs further 
demonstrate the larger overlap between correspond¬ 
ing quintiles in the within language setups compared 
to the cross-language setups, highlighting the impact 
of JL differences on this phenomenon^] 

All in all our results suggest that the concepts of 
word similarity and association may be JL depen¬ 
dent. Our next natural question is how the relations 
between the VSM TL and the human JL affect the cor- 

sions of the scores to subsets of 6 and 7 for the within language 
case, every subset of size 6 in one language with its correspond¬ 
ing subset in the other language in the cross language case), we 
produced two word pair rankings according to the average score 
within each subset. We then divided each ranked list to 5 quin¬ 
tiles and computed relative F-scores between each pair of cor¬ 
responding quintiles. We finally report the average F-score for 
each pair of corresponding quintiles across all these 1716 cases. 

23 We performed the same analysis for the cases where the 
Italian and Russian JLs are involved and observed very similar 
patterns. These graphs are omitted due to space constraints. 


relation between the model and the human scores. 

6 The VSM Training Language Effect 

Table [2] presents the Spearman p correlation coeffi¬ 
cient between human and model scores. 

Training Language Choice. For each of the JLs 
J, we first ask what is the TL T that leads to the 
monolingual model that best predicts human judg¬ 
ments with J. Both word association (WS353) and 
similarity (SL999) demonstrate very similar answers 
to this question. 

A first shared pattern is that English is overall the 
best choice of TL for both BOW and w2v: in 7 out 
of 8 cases for WS353 and in all 8 cases for SL999. 
A second shared pattern is that the JL itself is overall 
the second best TL, which is observed in 10 of the 
11 cases where English is the best TL for a given JL 
and JL != English. 

Judgment Language Choice. Our second ques¬ 
tion is complementary to the first one, namely, for 
each of the TLs T, what is the JL J that leads to 
human judgments that best correlate with the pre¬ 
dictions of the monolingual model trained with T. 










































































Here we observe considerable differences between 
word similarity and association. 

A first major difference is in the identity of the 
best JL. While for WS353 in 7 of 8 cases a model 
trained with a TL T best correlates with human judg¬ 
ments made with T as a JL, for SL999 both models 
best correlate with German judgments for all TLs. 

A second major difference is related to the En¬ 
glish JL. For WS353 in 3 out of the 5 cases where 
English is not the best JL for a model it is the sec¬ 
ond best JL. For SL999, in contrast, for all 8 TL and 
model type combinations, English is the JL with the 
lowest correlation. For this dataset, Italian is always 
the second best JL, and Russian is the third best. 


VSM Comparison. Our experiments also cast 
light on the effectiveness of the participating VSMs. 
For every combination of TL, JL and word pair 
dataset, the NN-based w2v is superior to the count- 
based BOW. This Ending supports recent conclu¬ 
sions on the superiority of ’’predict” models over 
their ’’count” counterparts (Baroni et al., 2014 1 . 


Quintile Analysis. To further investigate the mu¬ 
tual impact of the TLs and JLs, we replicated the 
quintile analysis of § [5] this time comparing the 
rankings of a model trained with language l\ to the 
human scores obtained with JL Res ults are pre¬ 
sented in the bottom line of Figure 


25 


Interestingly, like in the respective analysis for 
JL pairs, human-model disagreement is generally 
most prominent for word pairs that are considered 
of medium similarity or association. Note, however, 
that in the current analysis, the human-model agree¬ 
ment is weaker than the human-human agreement 
on the corresponding quintiles we explored in § [5] 
Moreover, while in the analysis of § [5] the F-score 
values in the within language setup are superior to 
their cross-language counterparts, here keeping the 
TL and JL identical does not result in superior F- 
scores in most cases. 


24 Since the models produce only one score per word pair, in 
this analysis we ranked the word pairs according to the model 
scores as well as according to the average of all 13 human 
scores, divided each ranked list to quintiles and computed a rel¬ 
ative F-score for each pair of corresponding quintiles. 

25 For brevity, we present only the curves for 11 = German, 
the patterns for the other cases are very similar. 


Observations. Our analysis leads to several obser¬ 
vations. First, word similarity and association judg¬ 
ments have a language specific component. Con¬ 
sequently, the JL is a good choice for model train¬ 
ing (first question) and the predictions of models 
trained on a given language are best correlated with 
human judgments performed with that language, at 
least for word association (second question). While 
this seems obvious in machine learning terms, as in¬ 
domain training is preferable and language change is 
analogous to domain change, the semantic nature of 
our tasks would suggest that VSMs should preserve 
their outcome across languages. Our results suggest 
that this latter assumption is not true. 

Second, English has a special status in VSM re¬ 
search: as a VSM TL for both association and simi¬ 
larity prediction (first question), and as a JL for word 
association. The special status of English as a TL 
may result from its simpler morphology which 
may allow more robust statistics to be collected. An¬ 
other possible explanation is that our evaluators are 
likely to have some command of English [^] which 
may bias their semantic judgments towards those 
made by an English trained model. 

The JL pattern is harder to understand. One pos¬ 
sible hypothesis is that the dominance of English for 
word association is the result of our evaluation sets 
being translations of sets originally authored in En¬ 
glish. Consequently, some important meaning com¬ 
ponents may get lost in translation. However, the 
poor similarity predictions of both models with ah 
four TLs when English is the JL, seriously challenge 
this hypothesis. 

Finally, for word similarity both VSMs are much 
better correlated with human scores when the JL is 
German compared to the other JLs and particularly 
to English. We will investigate this surprising obser¬ 
vation in future work. 

Training Corpus Size Effect. In the small train¬ 
ing setup our results were very similar to the re¬ 
sults reported above both in terms of qualitative pat¬ 
terns and in the numerical correlation values (up to 
0.02 difference in Spearman p). In the large train- 

26 This is reflected, for example, by the lower type-to-token 
ratio of English in our training corpora: English = 0.0028; Ger¬ 
man = 0.011; Italian = 0.0058; Russian = 0.012. 

27 We have not checked this. 










T | J 

English 

German 

Italian 

Russian 

E-G 

0.544 

0.528 

0.490 

0.504 

E-I 

0.575 

0.531 

0.516 

0.517 

E-R 

0.556 

0.526 

0.494 

0.515 

G-I 

0.504 

0.466 

0.479 

0.482 

G-R 

0.437 

0.450 

0.407 

0.473 

I-R 

0.508 

0.445 

0.475 

0.481 

E-E 

0.543 

0.518 

0.484 

0.492 

G-G 

0.342 

0.400 

0.353 

0.402 

1-1 

0.46 

0.400 

0.443 

0.416 

R-R 

0.395 

0.365 

0.355 

0.420 


(a) bow - WS353 


T | J 

English 

German 

Italian 

Russian 

E-G 

0.177 

0.334 

0.273 

0.180 

E-I 

0.201 

0.313 

0.276 

0.192 

E-R 

0.209 

0.318 

0.289 

0.217 

G-I 

0.131 

0.294 

0.238 

0.119 

G-R 

0.135 

0.288 

0.243 

0.145 

I-R 

0.164 

0.281 

0.256 

0.164 

E-E 

0.210 

0.302 

0.267 

0.215 

G-G 

0.078 

0.259 

0.194 

0.078 

I-I 

0.137 

0.235 

0.209 

0.110 

R-R 

0.137 

0.235 

0.223 

0.150 


(c) bow - sl999 


T | J 

English 

German 

Italian 

Russian 

E-G 

0.675 

0.667 

0.630 

0.630 

E-I 

0.696 

0.633 

0.662 

0.621 

E-R 

0.687 

0.645 

0.624 

0.646 

G-I 

0.657 

0.629 

0.625 

0.623 

G-R 

0.618 

0.628 

0.553 

0.645 

I-R 

0.657 

0.585 

0.606 

0.644 

E-E 

0.652 

0.620 

0.609 

0.591 

G-G 

0.540 

0.601 

0.497 

0.563 

I-I 

0.582 

0.500 

0.582 

0.514 

R-R 

0.571 

0.534 

0.498 

0.610 


(b) w2v - ws353 


T | J 

English 

German 

Italian 

Russian 

E-G 

0.263 

0.392 

0.313 

0.244 

E-I 

0.267 

0.371 

0.340 

0.260 

E-R 

0.233 

0.332 

0.302 

0.274 

G-I 

0.242 

0.380 

0.319 

0.223 

G-R 

0.212 

0.338 

0.284 

0.242 

I-R 

0.207 

0.311 

0.303 

0.250 

E-E 

0.261 

0.353 

0.311 

0.254 

G-G 

0.196 

0.344 

0.248 

0.169 

I-I 

0.206 

0.307 

0.300 

0.195 

R-R 

0.170 

0.256 

0.260 

0.241 


(d) w2v - sl999 


Table 3: Spearman p correlation coefficient between human scores and the outcome of a linear interpolation (LI) of the scores of 
pairs of monolingual models. The (T — L\ — L2, J = L3) entry of each table is the correlation of (1) the outcome of a LI of the 
scores of monolingual models trained on languages L i and L 2 with (2) the human scores produced with the JL L 3 . Cases where 
the LI of LI and L 2 outperforms a monolingual model trained on L 3 (where L 3 is the JL) are highlighted in bold. . 


T | J 

English 

German 

Italian 

Russian 

E-G 

-0.130 

-0.071 

-0.070 

-0.121 

E-I 

-0.068 

-0.033 

-0.011 

-0.070 

E-R 

-0.099 

-0.043 

-0.026 

-0.090 

G-I 

-0.076 

-0.052 

-0.038 

-0.081 

G-R 

-0.128 

-0.068 

-0.075 

-0.126 

I-R 

-0.062 

-0.034 

-0.032 

-0.084 

E-E 

-0.112 

-0.049 

-0.034 

-0.103 

G-G 

-0.14 

-0.078 

-0.107 

-0.129 

I-I 

-0.06 

-0.036 

-0.006 

-0.059 

R-R 

-0.112 

-0.059 

-0.039 

-0.103 


(a) BOW - WS353 


T | J 

English 

German 

Italian 

Russian 

E-G 

0.222 

0.234 

0.273 

0.253 

E-I 

0.232 

0.214 

0.260 

0.236 

E-R 

0.270 

0.240 

0.289 

0.277 

G-I 

0.199 

0.211 

0.249 

0.216 

G-R 

0.212 

0.206 

0.246 

0.228 

I-R 

0.223 

0.192 

0.239 

0.226 

E-E 

0.244 

0.226 

0.274 

0.258 

G-G 

0.149 

0.185 

0.215 

0.182 

I-I 

0.183 

0.175 

0.209 

0.176 

R-R 

0.191 

0.189 

0.221 

0.198 


(c) BOW - SL999 


T | J 

English 

German 

Italian 

Russian 

E-G 

0.340 

0.356 

0.312 

0.273 

E-I 

0.295 

0.289 

0.302 

0.259 

E-R 

0.274 

0.291 

0.261 

0.286 

G-I 

0.286 

0.311 

0.291 

0.268 

G-R 

0.284 

0.339 

0.267 

0.307 

I-R 

0.223 

0.218 

0.230 

0.261 

E-E 

0.28 

0.251 

0.23 

0.226 

G-G 

0.206 

0.291 

0.212 

0.219 

I-I 

0.228 

0.227 

0.205 

0.147 

R-R 

0.236 

0.252 

0.253 

0.297 


(b) w2v - WS353 


T | J 

English 

German 

Italian 

Russian 

E-G 

0.319 

0.434 

0.374 

0.296 

E-I 

0.312 

0.408 

0.395 

0.297 

E-R 

0.296 

0.361 

0.355 

0.320 

G-I 

0.289 

0.417 

0.369 

0.257 

G-R 

0.275 

0.366 

0.330 

0.287 

I-R 

0.262 

0.331 

0.346 

0.287 

E-E 

0.332 

0.399 

0.380 

0.310 

G-G 

0.251 

0.394 

0.310 

0.237 

I-I 

0.250 

0.353 

0.358 

0.217 

R-R 

0.240 

0.294 

0.309 

0.288 


(d) w2v - sl999 


Table 4: Spearman p correlation coefficient of the scores resulting from a CCA combination of monolingual models, with corre¬ 
sponding human scores. Table entries and highlighting is as in Table[3| 


ing setup we observed the exact same patterns de¬ 
tailed above but the Spearman p values for every 
(TL, JL) pair were higher than those of Table[2j with 
the p differences having the following (mean, std) 
values: BOW/WS353: (0.036,0.025), w2v/ws353: 
(0.031,0.013), bow/sl999: (0.048,0.046) and 


w2v/sl999: (0.042,0.033). 


Our final investigation is of the potential of mono¬ 
lingual VSM combination to compensate for the JL 
effect. 























































































































































7 The Multilingual Combination Effect 

We explore two simple methods for the combination 
of VSMs trained on coipora of different languages, 
/I and 12. In the first method, linear interpolation 
('Ll), we combine the scores produced by two VSMs 
for a word pair (wi, Wj) using the linear equation: 


Score(wi,Wj) = \-scii(wi,Wj) + (l — \)-sci2(wi,Wj) 

Where scu(wi,Wj ) is the score produced by the 
model trained on the li language and A E [0, l]p*| 
Our second combination method is Canonical 
Correlation Analysis (CCA). For each pair of lan¬ 
guages, (hjh), we calculated a pair of projection 
matrices to the shared subspace through the CCA 


method (Hardoon et al., 2004), using the vectors in¬ 
duced by monolingual models trained on an ij and 
an l-z coipora. We then constructed a multilingual 
vector representation for each word by concatenat¬ 
ing the l\ and 1 2 projected representations. 


We compare the performance of each multilin¬ 
gual combination method to a monolingual baseline 
in which the predictions of two monolingual mod¬ 
els, each trained with randomly sampled 80% of the 
same monolingual training corpus, were combined 
using one of the above methods^] This is done in 
order to rule out the possibility that our improve¬ 
ments are the mere result of the smoothing effect 
that model combination provides. 

Tables[3](top 6 lines of each table) presents results 
for multilingual LI. The numbers clearly show that 
this is an effective method of combining two mono¬ 
lingual models, leading to improvements over the 
participating monolingual models in most dataset 
and model combinations^ Improvements com- 

28 We experimented with A € {0.25, 0.33,0.5,0.67, 0.75} 
and got improvements for most combinations of TL pairs, JLs 
and As (see below). We report results with A = 0.5, giving both 
monolingual models an equal w eig ht. 

29 Following (Faruqui and Dyer, 2014) we also experimented 


with taking one of the monolingual projected vectors as the mul¬ 
tilingual representation and got very similar results. 

30 We applied both protocols for the combination of three and 
four monolingual models and did not observe substantial im¬ 
provements over two-language multilingual models. 

31 These results were averaged over 5 random samples from 
each of the corpora. For LI we naturally used A = 0.5. For CCA 
we employed the same protocol as in multilingual combination. 

32 This effect is not highlighted in the table but is evident from 
a comparison to the numbers reported in Table [2] 


puted with respect to monolingual models trained on 
the JL (TL = JL, i.e. the results on the main diago¬ 
nals of the sub-tables of table [2]), are more promi¬ 
nent for German, Italian and Russian than for En¬ 
glish (highlighted in bold in Table [3]), which is not 
surprising given that English is the best TL of mono¬ 
lingual VSMs for the majority of evaluation set, JL 
and model combinations (§ [6]). Multilingual inter¬ 
polated models improve over such non-interpolated 
monolingual models in 68 of 96 cases (70.8%). 

Comparison to monolingual LI (bottom 4 lines of 
each table) reveals the impact of the multilingual 
combination. As an example indication, monolin¬ 
gual LI improves over monolingual models trained 
on the JL in only 18 of 64 cases (28.1 %)p 3 ] 

Interestingly, CCA combination improves over 
monolingual models trained on the JL only for 
SL999. This result adds mixed observations to pre¬ 
vious positive results on the effect of CCA combina¬ 
tion for multilingual VSM construction with the En¬ 
glish JL (Faruqui and Dyer, 2014 ) and for the combi¬ 


nation of visual and textual representations (Silberer 


and Lapata, 2012 Hill et al., 2014a 1 . 


Like in § [6j we controlled against corpus size ef¬ 
fects. The results of both the small and the large 
training setups were very similar to those reported 
above both qualitatively and quantitatively. For ex¬ 
ample, in the large training setups the multilin¬ 
gual interpolated models improved over monolin¬ 
gual non-interpolated models trained on the JL in 
61.1% of the cases, compared to 16.7% of the cases 
where the monolingual interpolated models achieve 
such an improvement. The differences in numerical 
Spearman p values were up to 0.01 across setups. 


8 Conclusions 

In this paper we aimed to establish the importance 
of the human JL in lexical semantics research. We 
translated and re-scored two prominent datasets, 
WS353 and SL999, and demonstrated the impact 
of the JL on: (a) human semantic judgments; and 

33 A simple concatenation of the monolingual vectors is also 
an effective combination method of monolingual models, lead¬ 
ing to improvements that are similar to what we report for LI. 
However, simple concatenation is effective for the BOW model 
only when PPMI normalization is applied to the row counts, 
as opposed to LI which is effective regardless of this step. We 
therefore focus on LI, the more robust method. 













(b) the correlation of monolingual and multilingual 
VSM predictions, produced with varios training lan¬ 
guages, with human judgments. 

In future work we intend to extend our inquiry to 
relations beyond word association and similarity and 
to a larger number of TLs and JLs. We further in¬ 
tend to explore more advanced methods for multilin¬ 
gual VSM construction. Finally, we would like to go 
beyond quantitative analysis and identify qualitative 
patterns in our data. Our ultimate goal is to construct 
VS Ms that directly account for the relations between 
their TL(s) and potential human JLs. 
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